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Abstract 



The Lasso is a popular model selection and estimation procedure for linear models that enjoys nice 
theoretical properties. In this paper, we study the Lasso estimator for fitting autoregressive time series 
models. We adopt a double asymptotic framework where the maximal lag may increase with the sample 
size. We derive theoretical results establishing various types of consistency. In particular, we derive 
conditions under which the Lasso estimator for the autoregressive coefficients is model selection consistent, 
estimation consistent and prediction consistent. Simulation study results are reported. 

1 Introduction 

Classical stationary time series modeling assumes that data are a realization of a mix of autoregressive 
processes and moving average processes, or an ARMA model (see, e.g. Davis and Brockwell, 1991). Typically, 
both estimation and model fitting rely on the assumption of fixed and low dimensional parameters and 
include (i) the estimation of the appropriate coefficients under the somewhat unrealistic assumption that the 
orders of the AR and of the MA processes are known in advance, or (ii) some model selection procedures 
that sequentially fit models of increasing dimensions. In practice, however, it is very difficult to verify 
the assumption that the realized series does come from an ARMA process. Instead, it is usually assumed 
that the given data are a realization of a linear time series, which may be represented by an infinite-order 
autoregressive process. Some study has been done on the accuracy of an AR approximation for these 
processes: see Shibata (1980), Goldcnshlugcr and Zccvi (2001) and Ing and Wei (2005). In particular, 
Goldcnshlugcr and Zeevi (2001) propose a nonparametric minimax approach and assess the accuracy of a 
finite order AR process in terms of both estimation and prediction. 

This paper is concerned with fitting autoregressive time series models with the Lasso. The Lasso pro- 
cedure, proposed originally by Tibshirani (1996), is one of the most popular approach for model selection 
in linear and generalized linear models, and has been studied in much of the recent literature; see, e.g., Fan 
and Peng (2004), Zhao and Bin (2006), Zou (2006), Wainwright (2006), Lafferty ct al (2007), and Nardi and 
Rinaldo (2008), to mention just a few. The Lasso procedure has the advantage of simultaneously performing 
model selection and estimation, and has been shown to be effective even in high dimensional settings where 
the dimension of the parameter space grows with the sample size n. In the context of an autoregressive 
modeling, the Lasso features become especially advantageous, as both the AR order, and the corresponding 
AR coefficients can be estimated simultaneously. Wang et al. (2007) study linear regression with autoregres- 
sive errors. They adapt the Lasso procedure to shrink both the regression coefficients and the autoregressive 
coefficients, under the assumption that the autoregressive order is fixed. 

For the autoregressive models we consider in this work, the number of parameters, or equivalently, the 
maximal possible lag, grows with the sample size. We refer to this scheme as a double asymptotic framework. 
The double asymptotic framework enables us to treat the autoregressive order as virtually infinite. The 

*Email: yuval@stat.cmu.edu 
tEmail: arinaldoOstat.cmu.edu 



autoregressive time series with an increasing number of parameters lies between a fixed order AR time series 
and an infinite-order AR time scries. This limiting process belongs to a family which is known to contain 
many ARM A processes (see Goldcnshlugcr and Zccvi, 2001). In this paper we show that the Lasso procedure 
is particularly adequate for this double asymptotic scheme. 

The rest of the paper is organized as follows. The next section formulates the autoregressive modeling 
scheme and defines the Lasso estimator associated with it. Asymptotic properties of the Lasso estimator 
are presented in Section 3. These include model selection consistency (Theorem 3.1), estimation consistency 
(Theorem 3.2), and prediction consistency (Corollary 3.4). Proofs are deferred to Section 6. A simulation 
study, given in Section 4, accompany the theoretical results. Discussion and concluding remarks appear in 
Section 5. 



2 Penalized autoregressive modeling 

In this section we describe our settings and set up the notation. 

We assume that X\ , . . . , X n are n observations from an AR(p) process: 

X t — 4>iX t -i + ■ ■ ■ + (ppXt-p + Z t , t=l,...,n, (1) 

where {Z t } is a random sequence of independent Gaussian variables with EZ t = 0, E|Z t | 2 = a 2 and 
cov(Z t , X s ) = for all s < t. The last requirement is standard, and rely on a reasoning under which 
the process {X t } does not depend on future values of the driving Gaussian noise. The assumption about 
Gaussianity of {Z t } is by no means necessary, and can be relaxed. It does, however, facilitate our theoretical 
investigation and the presentation of various results, and therefore, it is in effect throughout the article. In 
Section 5 we comment on how to modify our assumptions and proofs to allow for non-Gaussian innovations 

We further assume that {X t } is causal, meaning that there exists a sequence of constants {ipj}, j — 
0, 1, . . ., with absolutely convergent series, Y^jLo I^J'I < °°> sucn tnat nas a MA(oo) representation: 

oo 

Xt = Y t i> j Z t - j , (2) 
j=o 

the series being absolutely convergent with probability one. Equivalently, we could stipulate that {X t } 
is purely non-deterministic, and then obtain representation (2), with ipo = 1 and ^ °°> directly 

from the Wold decomposition (see, e.g. Davis and Brockwcll, 1991). A necessary and sufficient condition 
for causality is that I — <f>±z — . — <f)pZ p ^ for all complex z within the unit disc, \z\ < 1. Notice that 
causality of {^t}, and Gaussianity of {Z t }, together imply Gaussianity of {Xt}. This follows from the fact 
that mean square limits of Gaussian random variables are again Gaussian. The mean and variance of X t 
are given, respectively, by EA t = 0, E|A t | 2 = ^Y^jLo^- We assume, f° r simplicity, and without any loss 
of generality, that E|A t | 2 = 1, so that Y^jLo^ = ° 2 - Let be the autocovariance function given by 
7(fc) — KX t X t +k, and let T p = — j))^ ^ , the p x p autocovariance matrix, of lags smaller or equal 
to p — 1. 

We now describe the penalized l\ least squares estimator of the AR coefficients. Let y = (X 1; . . . , X n )', 
4> = (0i, . . . , cf> P Y , and Z = (Z±, . . . , Z n )', where apostrophe denotes transpose. Define the n x p matrix X 
with entry X t -j in the tth row and jth column, for t = 1, . . . , n and j = 1, . . . ,p. The Lasso-type estimator 
4> n = <j) n (h n ) is defined to be the minimizer of: 

1 P 

— ||y-X0|| 2 + A n ^A n , i |0 i |, (3) 

where A„ = {A n , {X n ,j > j = 1, ■■■,?}} are tuning parameters, and || • || denotes the ^-norm. Here, X n 
is a grand tuning parameter, while the {\ n ,j, j — L • ■ ■ ,p} are specific tuning parameters associated with 
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predictors X t - j . The Lasso solution (3) will be sparse, as some of the autoregressive coefficients will be to 
set to (exactly) zero, depending on the choice tuning parameters A n . Naturally, one may want to further 
impose that X n j < X n _k for lags values satisfying j < k, to encourage even sparser solutions, although this is 
not assumed throughout. The idea of using t\ regularization to penalize differently the model parameters, 
as we do in (3), was originally proposed by Zou (2006) under the name of adaptive Lasso. As shown in 
Zou (2006), from an algorithmic point of view, the solution to our adaptive Lasso (3) can be obtained by a 
slightly modified version of the LARS algorithm of Efron ct al. (2004). A possible choice for A nj would be 
to use the inverse least squares estimates, as in Zou (2006), but this is not pursued here. 

As mentioned before, we consider a double asymptotic framework, in which the number of parameters 
p = p n grows with n at a certain rate. Clearly, the "large p small n" (p 3> n) scenario, which is an important 
subject of many of nowadays articles, is not adequate here. Indeed, one might be suspicious about the 
statistical properties of the proposed estimator even when p is comparable with n (p < n, but is close to n). 
Accounting for the mechanism of the autoregressive progress, one is led to think that p should grow with 
n at a much slower rate. This article shows that the choice of p = O (log n) will lead to nice asymptotic 
properties of the proposed procedure (3) . Such a choice of the AR order arises also in Goldcnshlugcr and 
Zccvi (2001), who prove minimax optimality for a different regularized least squares estimator. Moreover, as 
pointed out in Goldcnshlugcr and Zccvi (2001), the same order of p arises also in spectral density estimation 
(see Efromovich (1998)). Finally, similar rate appears also, in a different context, in Rothman et al. (2007). 

In classical linear time series modeling, one usually attempts to fit sequentially an AR(p) with increasing 
orders of the maximal lag p (or by fixing p and then estimating the coefficients) . The Lasso- type estimator of 
scheme (3) will shrink down to zero irrelevant predictors. Thus, not only that model selection and estimation 
will occur simultaneously, but the fitted (selected) model will be chosen among all relevant AR(p) processes, 
with p = O (log n) . 

3 Asymptotic Properties of the Lasso 

In this section we derive the asymptotic properties of the Lasso estimator <p n . These include model selection 
consistency, estimation consistency and prediction consistency. We briefly describe each type of consistency, 
develop the needed notation, and present the results, with proofs relegated to Section 6. 

3.1 Model Selection Consistency 

We assume that the AR(p) process (1) is generated according to a true, unknown parameter (j>* = (4>\ , . . . , <j>*). 
When p is large, it is not unreasonable to believe that this vector is sparse, meaning that only a subset of 
potential predictors are relevant. Model selection consistency is about recovering the sparsity structure of 
the true, underlying parameter <f>* . 

For any vector 4> 6 K p , let sgn(^) = (sgn(0i ),..., sgn(0 p )), where sgn(^) is the sign function taking 
values —1, or 1, according to as <f>,j < 0, (pj = or <f>j > 0, respectively. A given estimator <fi n is said to be 
sign consistent if sgn(0 n ) = sgn(</>*), with probability tending to one, as n tends to infinity, i.e., 

P(sgn((/>„) = sgn(</>*)) — >1 n^oo. (4) 

Let S — {j : (f>* 7^ 0} = supp(0*) C {1, 2, . . . ,p}. A weaker form of model selection consistency, implied by 
the sign consistency, only requires that, with probability tending to 1, <j)* and <p n have the same support. 

We shall need a few more definitions. Let s = \S\ denote the cardinality of the set of true nonzero 
coefficients, and let v = p — s = \S C \, with S c — {1, . . . ,p} \ S. For a set of indexes /, we will write 
xi = {xi, i G 1} to denote the subvector of x whose elements are indexed by the coordinates in /. Similarly, 
xiyi is a vector with elements XiUi. For a nx p design matrix X, we let Xi, for any subset I of {1, 2, . . . , p}, 
denote the sub-matrix of X with columns as indicated by /. Sub-matrices of the autocovariance matrix 
T p (and of any other matrix), are denoted similarly For example, Tjjc is (-y(i — j))iei.j^i- Finally, let 



3 



a n = mm j e s \<ft*\ denote the magnitude of the smallest nonzero coefficient. Finally, although virtually all 
quantities related to (3) depend on n, we do not always make this dependence explicit in our notation. 
We are now ready to present our first result: 

Theorem 3.1. Consider the settings of the AR(p) process describe above. Assume that 

(i) there exists a finite, positive constant C max such that ||r^|j < C max ; 

(ii) there exists an e £ (0, 1] such that WTs^s^ssW 00 — 1 — e - 
Further, assume that the following conditions hold: 

max ieS X nA 

limsup — < 1 , (5) 



n — >oo 




, as n — > oo , (6) 

, as n-»oo. (7) 

maxjs, v) 

Let p = O (logn). Then, the Lasso estimator <j) n is sign consistent (cf. (4))- 

Condition (ii) in Theorem 3.1 is assumed in various guises elsewhere in the Lasso literature (see, e.g., 
Wainwright (2006), Zhao and Bin (2006) and Zou (2006)). It is an incoherence condition, which controls the 
amount of correlation between relevant variables and irrelevant variables. Condition (5) is intuitively clear 
and it appears under similar form in Nardi and Rinaldo (2008). It captures the rationale , recalling that 
one may have Aj < A& for j < k, that (even) the largest penalty coefficient of the relevant lags should be 
kept asymptotically smaller than the smallest penalty coefficient of the irrelevant lags. Conditions (6) and 
(7) are similar to conditions appearing in Wainwright (2006), Nardi and Rinaldo (2008), and Lafferty et al 
(2007), to name but a few. The fraction \J s/n in (6) is in line with similar works, mentioned above. For 
example, under the linear sparsity scheme, i.e., s — ap, with a £ (0, 1) (see Wainwright (2006)), and with 
p comparable to n, the Gaussian ensemble leads to a fraction of order O (logn/n), which is similar to the 
fraction under the current setting, for which we have p = O (logrt). 



3.2 Estimation and Prediction Consistency 

Our next result is about estimation consistency. An estimator <j) n is said to be estimation consistent, or 
^-consistent if \\<t> n — <fi*\\ converges to zero, as n tends to infinity. We have the following: 

Theorem 3.2. Recall the settings of the AR(p) process set forth below (1). Let p = O (logn), and a n — 
p 1 / 2 (?i -1 / 2 + A n ||A n) s||). Assume that A„||A n .s|| = O (n -1 / 2 ). Then, the Lasso estimator (f> n is estimation 
consistent with a rate of order O (a n ). 

Prediction consistency is about a similar convergence statement, but for the prediction of future values 
using the fitted model. Formally, prediction consistency holds if ||AT</>„ — X<p*\\ converges to zero, as n tends 
to infinity. We show below a similar result when the sample autocovariance matrix X' X is replaced by the 
(theoretical) autocovariance matrix T p . The autoregressive settings assumed here are, in some sense, much 
more challenging than in linear (parametric or non-parametric) regression models, for two reasons. Firstly, 
the design matrix is not fixed as is usually assumed, and secondly, the entries of the X are not independent 
across rows, as is usually assumed for random designs. 

The family of AR processes considered here are, in fact, a subset of a larger family of time series. In 
order to establish the prediction consistency result, we make an explicit use of the structure of this larger 
family, to which we now describe. 
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Following Goldenshlugcr and Zccvi (2001), we denote by !K p (l,L), for some p > 1, < I < 1, and L > 1, 
a family consisting of all stationary Gaussian time series with EXi = 0, E|A, | 2 = 1, and with 

< / < |^0)| < L , 

for every complex z with \z\ < p, where tp(z) is the MA(oo) transfer function related to the AR polynomial 
by ff,(z) = l/<P(z). 

We shall need the notion of a strong mixing (or a-mixing) condition. Let {X t } be a time series defined 
on a probability space (£1,9, P). For any two (sub) <7-fields A and 23, define 

a(yi, S) = sup |P( A OB)- P(A)¥(B) | . 
AeA.Be'B 

Denote by 5F*, the a-field generated by (X s , . . . , X t ), for — oo < s < t < oo. Then, {X t } is said to be strongly 
mixing if ax(m) — > 0, as m — > oo, where 

ax(m) = sup "(S^oo, 3^ m ) • 

j6{0,±1,±2,...} 

Attractiveness of 3{. p (l,L) comes from the fact that processes in 3-C p (l,L) are strong mixing with an 
exponential decay, i.e. 

M-)<2(^)V-. (8) 

This follows since processes in JC p (l, L) have exponentially decaying AR coefficients as well as exponentially 
decaying autocovariances (see (Goldenshlugcr and Zeevi, 2001, Lemma 1, and in particular, expression (39))). 

For every p-dimensional vector a and p x p symmetric matrix A, we denote with \\a\\ 2 A = a'Aa, the 
(squared) ^-norm associated with A. Let C\, C 2 be two universal constants (their explicit values are given 
within the proof of the following theorem) . Define 

/3 1 = 1+J- fa =X+r J* t an d D = (C?C 2 /3 1 2 ^) 1 / 5 . (9) 
logp 

Let A m in = min J= i ... p An.j , and A max = maXj=i ) ...,pA 7ll j. We have: 

Theorem 3.3. Recall the settings of the AR(p) process set forth below (1). Let p = O(logn). Assume: 

(i) There exists a finite, positive constant M such that X n _j < M, for every j = 1, . . . ,p. 

(ii) For every p>2, there exists a positive constant n p , such that 

T p — K p diag(Tp) 

is a positive semi-definite matrix. 

If Aj^s/p) 1 / 2 < Dn~ 2 / 5 , then there exist a constant C (depending only on M), and constants F\ and F 2 
(depending only on C\, C 2 , f3\, /3 2 ), such that for all < c < oo, and all y > o 2 {n + Dn 3 ^ 5 ), 

\\k-^\\ 2 r p <CX 2 n - 

Kp 

holds true with probability at least 1 — n n , where 

7T„ < 6pcxp j -Fx min ^y - n) 1 ' 3 , c 2 a~ 2 , y fJ^^ /2 } } + P 2 exp {~F 2 n\ 2 n (s/p 2 )} . (10) 
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Condition (ii) has been used in the context of aggregation procedures for nonparametric regression with 
fixed design (Bunca et al. (2007a)), and also for nonparametric regression with random design (Bunea et al. 
(2007b)). 

Theorem 3.3 may be utilized to show that the Lasso estimator <p n is prediction consistent. One only 
needs to make sure that the decay of the bound (10) on 7r„. The theorem actually gives a whole range of 
possible rates of decay, by picking c and y. One possible choice is given below. 

Corollary 3.4. Let X n — n~ a , with a G (2/5,1/2). Let c = -Di2//(nA Tl A max ) ; and y = D2n, for positive 
constants Di,D 2 . If (s/p) 1 ^ 2 < Dn a ~ 2 / 5 , then there exists an appropriate constant F, such that the bound 
(10) on 7r„ is smaller than 

p 2 exp{ -Fmm{n 1 /^n 2a /Xl ax ,n 1 - 2a X 2 min ,n 1 - 2a s/p 2 ]} , 

which tends to zero as n goes to infinity. 

4 Illustrative Simulations 

We consider a sparse autoregressive time series of length 1000 obeying the model 

X t = 0.2JQ_! + 0.1JQ_ 3 + 0.2X t _ 5 + 0.3AV 10 + 0.1X t _i B + Z u (11) 

with nonzero coefficients at lags 1, 3, 5, 10 and 15, where the innovations Z t are i.i.d. Gaussians with mean 
zero and standard deviation 0.1. The coefficients were chosen to satisfy the characteristic equation for a 
stationary AR process. 

Figure 1 shows one time series simulated according to the model (11), along with its autocorrelation and 
partial autocorrelation plots. For this time series, Figure 2 shows the solution paths computed using the R 
algorithm lars and for a value of p = 50. Notice that we only use one penalty parameter, i.e. we penalize 
equally all the autoregressive coefficients. The vertical line marks the optimal l\ threshold found by cross 
validation. In our simulations, we declared significant the variables whose coefficients have nonzero solution 
paths meeting the vertical line corresponding to the cross validation value. 

Notice that, in the exemplary instance displayed in Figure 2, all the nonzero autoregressive coefficients 
are correctly included in the model. Furthermore, a more careful inspection of the solution paths reveals that 
the order at which the significant variables enter the set of active solutions match very closely the magnitude 
of the coefficients used in our model, with <pio and 4>q, the more significant coefficients, entering almost 
immediately, and cj>3 and 4>iq entering last. In contrast, Figure 3 displays the fitted values for the first 30 
autoregressive coefficients computed using the Yule- Walker method implemented using R by the routine ar 
(note that the Yule- Walker estimator has the same asymptotic distribution as the MLE's). Notice that the 
solution is non-sparse. The dashed vertical gray lies indicate the true nonzero coefficients. The autoregressive 
order of the model was correctly estimated to be 15 using the AIC criterion. 

We simulated 1000 time series from the model (11) and we selected the significant variables according 
to the cross-validation rule described above. Figure 4 a) displays the histogram of the number of selected 
variables. The mean and standard deviations of these numbers are 6.42 and 2.44, respectively, while the 
minimum, median and maximum numbers are 3, 6 and 22, respectively. In comparison, Figure 4 b) shows the 
histogram of the autoregressive orders determined by AIC in ar. Table 1 displays some summary statistics 
of our simulations. In particular, the second row shows the number of times, out 1000 simulated time 
series, that each of the nonzero autoregressive coefficients was correctly selected. The second row indicates 
the number of times the variable corresponding to each nonzero coefficient in (11) was among the first five 
selected variables. Notice that (f>io and 4>§ are always included among the selected variables, while ^3 and 
15 have a significantly smaller, but nonetheless quite high, chance of being selected. 

We also investigated the order at which the autoregressive coefficients entered the solution paths, the 
rationale being that more significant nonzero variables enter sooner, in accordance with the way the lars 
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Figure 1: A time series simulated from the sparse autoregressive model (11) along with its autocorrelation 
and partial autocorrelation coefficients. 

algorithm works (see Efron et al. (2004)). Figure 5 summarizes our findings. In each of the barplots, the x- 
axis indexes the steps at which the variable corresponding to the autoregressive coefficient enters the solution 
path, while the y-axis displays the frequency. Interestingly enough, in most cases, 0io and fa are selected as 
the first and second nonzero variables, while fa§ and, in particular, fa enter the set of active variables later 
and are not even among the first five variables selected in 1.9% the and 20.2% of cases, respectively. 



7 



L AR 




Figure 2: Solution paths of the lars algorithm when applied to the time series displayed in Figure 1. The 
vertical bar represents the optimal l\ penalty for this time series selected using cross validation. 

Table 1 : Number of times the nonzero autoregressive coefficients are correctly identified and number of times 
they are correctly selected among the first 5 variables entering the solution paths. 
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Figure 3: Autoregressive coefficients for the time series of Figure 1 obtained using the routine ar. The 
dashed vertical line mark the lags for true nonzero coefficients. 

5 Discussion 

We defined the Lasso procedure for fitting an autoregressive model, where the maximal lag may increase with 
the sample size. Under this double asymptotic framework, the Lasso estimator was shown to possess several 
consistency properties. In particular, when p = O (logn), the Lasso estimator is model selection consistent, 
estimation consistent, and prediction consistency. The advantage of using the Lasso procedure in conjunction 
with a growing p is that the fitted model will be chosen among all possible AR models whose maximal lag 
is between 1 and O (log n) . Letting n go to infinity, we may virtually obtain a good approximation for a 
general linear time series. 

As mentioned in Section 2, the assumption about Gaussianity of the underlying noise {Z t } is not neces- 
sary. The proof of the model selection consistency result (Theorem 3.1) avoids making use of Gaussianity by 
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Figure 4: Distribution for the number of variables selected by a) the lars algorithm using cross validation 
and b) the ar algorithm using AIC over 1000 simulations of the time series described in (11). 



using Burkholder's inequality in conjunction with a maximal moment inequality. The proof of the estima- 
tion consistency result (Theorem 3.2) requires Lemma 6.2, which does make use of the assumed Gaussianity. 
However, this is not crucial. In fact, we can relax the Gaussianity assumption and require only the Z t are 
IID(0, a 2 ), with bounded fourth moment (see (Davis and Brockwell, 1991, p. 226-227)). In this case, instead 
of using Wick's formula we may apply the moving average representation X t = Y^jLo^J^t-J' a l° n g with 
the absolute summability of the ipj's. Finally, the prediction consistency result (Theorem 3.3 and Corollary 
3.4) may also be obtained by relaxing the Gaussianity assumption. One only needs to impose appropriate 
moment conditions of the driving noise. 

The autoregressive modeling via the Lasso procedure stimulates other interesting future directions. In 
many cases, non-linearity is evident from the data. In order to capture deviation from linearity, one may try 
to fit a non-linear (autoregressive) time series model to the data in the form 

p V 

X t = (j>xXt-x + ■■■ + cj> p X t _ p + >T{o" ' \[ X, , | • Z t , 

u=2 j=l 

where we used the Einstein notation for the term in the curly brackets, to indicate summation over all 
i% < i2 < ■ ■ ■ < i v - Notice that for even mild values of p, the number of possible interaction terms may be 
very large. This is a very challenging problem as one needs to obtain a solid understanding of the properties 
of the non-linear autoregressive process before applying the Lasso (or any other) procedure. 



6 Proofs 

Here we prove Theorems 3.1, Theorem 3.2, and Theorem 3.3. Recall scheme (3). This is a convex minimiza- 
tion problem. Denote by M\ n (-), for A„ = {X ni {X n j , j — 1, . . .p}}, the objective function, i.e., 

1 p 

M An (<f>) = ~\\y-Xc(>f + \ n Y,A n , j \<i> j \ . (12) 

The Lasso estimator is an optimal solution to the problem min{AfA n ((/)) , <p S W}. Gradient and Hessian of 
the least-squares part in Ma„(-) are given, respectively, by n~ 1 X(f> — rtT 1 Y^t=i Xt^-t, and n _1 X, where X 
(the gram matrix associated with the design matrix X), and X t is a notation that we use throughout this 
section: 

X = X'X , Xt = (Xt-i, ■ ■ ■ , Xt-p) . 
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Figure 5: Frequencies of the order at which the 5 autoregressive coefficients entered the solutions paths for 
the lars algorithm over 1000 simulations of the time series described in (11). 

6.1 Model Selection Consistency 

Proof of Theorem 3.1. We adapt a Gaussian ensemble argument, given in Wainwright (2006), to the present 
setting. Standard optimality conditions for convex optimization problems imply that cj> n £ W is an optimal 
solution to the problem min{MA„ (4>) , <f> S K p }, if, and only if, 

1 1 ™ 

-Xj> n - - V X t X. t + KL = , (13) 
n n ' 

t=i 

where € M p is a sub-gradient vector with elements £„j = sgn(</>„ J )A IlJ - if (f> n j ^ 0, and < A nj 

otherwise. Plugging the model structure, y — Xcf>* + Z, into (13), one can see that the optimality conditions 
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become 



1 - 

") / , Z t Xt + A„£„ 

n £ — ' 







(14) 



Recall the sparsity set, S — {j : <fi* ^ 0} = supp(0*), the sparsity cardinality s = \S\, and v = p — s = 
\S C \. Decomposing the design matrix X to relevant and non-relevant variables, X — (Xg, Xgc) 7 we may 
write the gram matrix X as a block matrix of the form 



%ss -Ess? 



X' s Xs X' s Xsc 

X'gaXs X'gaXsc 



Notice, for example, that Xss — (Et=i -^t— i^t-j)^ - 6 <j- Incorporating this into the optimality conditions 
(14) we obtain the following two relations, 



1 1 " 

-%SS [</>n,S ~ <P*s] E ^* X t = ~^nK,S Sgn((j)* s ) , 

t=l 

1 " 

<j] E Z t X.f = — \ n £ n ,S c 7 



n 



t=l 



where Xf , and Xf are vectors with elements {X t -i , i G S}, and {X t -i , i € S c }, respectively. If n — s > s 
then Xss is non-singular with probability one, and we can solve for 4> n ,s and £n,s, 



1 —11™ 

^ + (-Xss) ~ V Z t Xf - A„A„, S sgn(^) 



^n£,n,S c 



-y* >" — 1 

-*-S c S-^sS 



1 n 

71 ' 



f - A n A„ !S sgn(^) 



1 n 

rj ^ — i 



Now, sign consistency is equivalent (see Lemma 1 in Wainwright (2006)) to showing that 



1 1 n 

-^ ss ) _1 [- V Z t X.? - A„A„, s sgn(<f s ) 
n In z — ' 



1 " 

%s<=s%ss [~ E ^' X * ~~ sgn(^|) 



E^ x f 



> 



< AnAi 



(15) 
(16) 



hold, clcmcntwisc, with probability tending to 1. Denote the events in (15), and in (16) by A and 23, 
respectively. The rest of the proof is devoted to showing that P(A) — ► 1, and P(CB) — » 1, as n — ► oo. 

We commence with A. Let a„ = min^gg |0*|. Recall the notation ||xj||oo for the norm on a set of 
indices /, i.e., max^g/ \xi\ (and similarly for matrices). It is enough to show that PdlAgHoo > a n ) — ► 0, as n 
tends to infinity, where 



1 —l i 

As = (-Xss) [- E Z * X t - A « A "^ sgn(^) 



(17) 



Confine attention to the matrix Xss- The entry at row i e S and column j g S is given by X)"=i ^t—i^t-j- 
Notice that, equivalently, we can write this as X)t=i-i X t X t +i-j. Following Davis and Brockwell (1991), one 



can show that n "^ss — > Tss m probability, as n — * 00, where L55 = (7(1 — _?')) 



and 7(-) is the 



autocovariance function, 7(^1) = EX t X t +h- Therefore, by assumption (i) in Theorem 3.1, there exists a finite 
constant C max , such that IKn^Xss)" 1 !)^ < op(l) + C max . We continue by investigating the probability 
associated with the term inside the square brackets in (17). 
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Notice that || X)t=i ^t-^t ll°° ^ s gi ven by maxj 6l g I X)t=i ^t^t—i\t where Z t and X t -i are independent 
random variables for each t — 1, . . . , n, and i € S. Fix an i g S, and define 



T„ = T n> i — Z t X t 



(18) 



Let !F n = <t(. . . , Z n _x, Z n ) be the sigma-field generated by {. . . , Z n -i, Z n }. Simple calculation shows that 
{T ni S^nln is a martingale. Finally, Let Y n = T n — T n _i denote the martingale difference sequence associated 
with T n . We quote below a result concerning martingales moment inequalities, which we shall make use of. 

Theorem 6.1 (Burkholder's Inequality). Let {X n ,3 : n }'^' =l be a martingale, and X n = X n — X n -\ be the 
associated martingale difference sequence. Let q > 1. For any finite and positive constants c = c{q), and 
C = C(q) (depending only on q), we have 



n 



X 



1/'/ 



> [nx n 



1 1/9 



n 
i=l 



i/'-l 



Applying Cauchy-Schwartz inequality followed by Burkholder's inequality, we obtain 



E|T, 



nl< [ E IE 



z t x t . 



1/2 



< c 



n 



E\Z?X?_ 



1/2 



< Coyfn , 



(19) 



(20) 



where C is a finite and positive constant (from Burkholder's inequality). The last inequality follows by the 
independence between Z t and X t -i, and since E|X t _i| 2 = 1. Fix an arbitrary, positive £ < oo. By a trivial 
bound we get 



E max |T„ 



< 



/oo 
P[\T n>i \>y]dy 



1 



recalling (20). Now, picking £ = y/sn, which is optimal, in the sense of obtaining an (asymptotically) smallest 
fraction, we have, 



-Emax|T ni | < JVhi + C 2 a 2 JTfn = O (\fsjn) 



(21) 



This, in turn, implies, utilizing (17) and Markov's inequality, that ¥(A) — > 1, by imposing the condition: 

1 r 



's/n + A n ||A„ iS || 



, as n — > oo 



which is condition (6). 

We turn to the event 25. Repeating the argument below (17), it is enough to show similar assertion about 
the event 25, with the modification of replacing Xs c S%ss> by -Ts^r^c;. A sufficient condition for this to hold 
is that {||-Bs°||oo < A„minigsc A„^} happens with probability tending to one, where 



B s < = r S c S r~i [i J2 Z * X t ~ A„A„ iS sgn(^)] - - Z t Xf . (22) 
t=i t=i 

Under the incoherence condition (condition (ii) in the statement of the theorem), we have the following 
upper bound: 



n n 

\Bs4oo < (1 - e)-|| V ZtXfHoo + (1 - e)A„||A„. s || 00 + -|| V Z t X 



SC || 

t IIOO 7 
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which leads to: F(||S S '= ||oo > A 

2(1 ' £) HE^tXflU > 6) +P(^— 2 ^||f>Xf Hoc > 6) , (23) 

with b = 1 — (1 — e)||A„ i 5|| 00 /mini e 5c A ni j. Note that inequality (23) follows by the inclusion {U + V > z} C 
{?/ > z/2} U {V > z/2}. Under condition (5), it would be enough to consider the right hand side of (23), 
replacing (the two instances of) b by e. For the first term in (23) we have 

/ 2(1 -e) „„ \ 1-e 2 1 



-V \ ■ x nS Z * X tll~ >e ) - i '■ \ EmaxlT^I, (24) 

which tends by (21) to zero once 

nA^(min ie gc A nji ) 2 

- ► oo , as n->oo. (25) 

s 

The same argument may be adapted for maxjggc |T n j|. We only need to replace S by S c . In this case we 
find that the condition 

nA^(min ie gc A nji ) 2 

- > oo , as n-too, (26) 

v 

is sufficient for showing that the second term in (23) converges to zero. Condition (7) in the statement of 
the theorem guarantees both (25) and (26). The proof is now complete. 

□ 

6.2 Estimation and Prediction Consistency 

Proof of Theorem 3.2. We follow Fan and Peng (2004). In particular, denoting a n = p 1//2 (n~ 1//2 +A n ||A„,,s||), 
we will show that for every e > there exists a constant C, large enough, such that 

inf M A (</>* + a n u) > M A (6*) > 1 - e , 

\\u\\=C " V nV '\ 

where Ma„(-) is the objective function and is given in (12). This implies that \\<j) n — <j)*\\ = Op(a n ). 

Multiplying both sides by n clearly does not change the probability. We will show that —n(M An (cf>* + 
a n u) — M\ n {(jf)) < holds uniformly over \\u\\ = C. Write 

p 

MK n {<t>) = K<t>) + \n^\n,i\<t>i\ > 

for h(cj)) = || 2/ - X(f)\\ 2 /2n. We have 

-n{M An {4>* + a n u) - M An {4>*)) < -n[h{^ + a n u) ~ h(</>*)] - n\ n + ~ \4>j\] ■ 

Consider separately the least squares term, and the term associated with the ^-penalty. We have, exploiting 
the fact that £j*=i X t X t = X(j>* + £" =1 Z t X*, 



—n[h((j>* + a n u) — h((f>*)] = a n v! Z t ~X-t — a 2 u'Xu/2 = I\—Ii. 

t=i 

Recalling the definition of T Ut i — J2t=i ^t^-t-i (see (18)), and utilizing the result in (20) we obtain 

n 

\h\ < a n \\u\\ 
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Moving on to I2, we write 

I2 = a 2 a v!%u/2 — no? n VL '(n~ X — T p )u/2 + na 2 l u'T p u/2 . (27) 

We know that n Xy tends in probability to 7(1 — j), where Xy = Y^it=i -^t—i^t—ji the (i, j) entry of X. 
This clearly implies ||n~ 1 X — r p || = op(l), in the fixed p scenario. Lemma 6.2 below shows that this may 
also hold true in the growing p scenario which we consider here. 

Lemma 6.2. Assume Yl'jLo IV'il < 00 ; as before. Then, 

Hn-^-TpH =o P (l) . (28) 

Proof. We adopt arguments given in (Davis and Brockwell, 1991, p. 226-227). Let e > be given. Using 
the fact that ||A|| < ||A||p, where || • \\p is the Frobenius matrix norm, £7 . \Aij\ 2 , we have 



In- 1 



1 P 

X-r p ||>e)<-2 £dtf, (29) 



e 



where dy = E(n _1 Xij — 7(1 — j)) 2 - We shall make use of Wick's formula. This formula gives the expectation 
of a product of several centered (joint) Gaussian variables G±, . . . ,Gn, in terms of the elements of their 
covariance matrix C = (cy): 

k 

E J^J Gi = y ' C»iia ' ' ' c i k -iik > 

for fc = 2m, and zero otherwise. The sum extends over all different partitions of {Gi, . . . , G2m} into to pairs. 
Applying the formula, we obtain: 



EX?j — EX(X t+ i_jX s X s+ i„j 

s,t=l— i 
n— i 

= (t 2 (* _ J) + - *) + 7(* - * + * - j')7(-(« -*) + *- j)) , 

s.t — l — i 

where we have used the equivalent representation Xy = J7_,* . Xt-Xt_|_j_j. 
A change of variables k = s — t shows that 



(j 2 ( s - *) + 7(* - * + * - i)7(-(s -*) + *- J')) = 

,t—l-i 

n-1 

ti[ 7 2 (0) + j 2 (i - j)} + 2 ^(n - k)[l 2 (k) +j(k + i- j)j(-k + 1 - 3)} 



k = l 



Therefore, 



^■ = ^ 2 (*-j) + -[7 2 (0)+7 2 (*-j) 



n* n 

n-l 



+ ^Ysin-k^W + ^k + i-j^i-k + i-j)}. (30) 

fe=i 

Notice that X}fe=i l7 2 (^) + l{k + * — + i — j)\ < 00. This may be seen by using the expression 

for the autocovariance function, j(h) — a 2 Y^jLo ^j^j+W ' anc ^ ^y utilizing the summability of the i/>j's, 
Sj^lo I^j'I ^ 00 • ^he expression (30) is therefore bounded by an 0{\/n) order term. This, in turn, shows 
that dij — 0(l/n), uniformly for every The proof is completed by recalling the RHS of (29), which is of 
the order of magnitude of 0(p 2 jn). □ 
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Using Lemma 6.2 we obtain 

Inalu'in- 1 ! - r p )«/2| < o P (l)na£||u|| a . (31) 

We complete the argument with a bound on the term associated with the penalties, —n\ n X)jes ^nj[|^ + 
ct n Uj\ — \<j>j\]- Applying the Cauchy-Schwarz inequality, along with the fact that ||a||i < ^/p||a||2 for every 
a G R p , it is clear that the above term is absolutely bounded by A„|| A ni s|| 00 \/sna n ||'u|| . Now, since the 
second term in I 2 (see (27)) dominates the other terms, the proof of the theorem is completed. 

□ 

Proof of Theorem 3.3. We begin as in Bunca et al. (2007b). Recall that ||a|| ^ stands for a'Aa, for every 
p-dimensional vector a, and p x p symmetric matrix A. We proceed by stating and proving two lemmas. 

Lemma 6.3. Let assumptions (i), and (ii) of Theorem 3.3 be in effect. Then, 

\\k ~ <t>*tx/n < 4A„M( SKp - 1 ) 1 /2||0„ _ ^ \\ Tp (32) 

holds true on 

2 ™ 

"Ji = {{-^Xt-jZtl < \ n \ nJ , for all j = l,...,p\. (33) 

71 i=l 

Proof. By definition, the Lasso estimator <f> n satisfies (see (12)), 

v v 
n- 1 || 2 /-X0„|| 2 + 2A„^A„ J #„ J | <n- 1 ||y-X0*|| 2 + 2A„^A„ J | ( /»*| . 

Recalling the model y = Xip* + Z , we obtain, by re-arrangeing the above terms, 

p 1 p 

110" _ 0*lll/n + E ^n,jWn,j\ < 2(0„ - (j)*)' -X' Z + 2A„ ^2 ^n,j\4>j\ ■ 
3=1 U 3 = 1 

Now, since ((f> n - (f>*)'^X'Z = Y%=i(<t>n,j ~ <l>j)i E"=i x t-j z t, we have, on 3 U 

p p 

110" - 0*|||/„ < K^2K,j\$ n ,j - 4>*j\ + 2A„^A„,j(|^| - 

3=1 3=1 

< 4A„^A nj # nj -0*| , (34) 

where the second inequality is obtained by decomposing the summation Y]j—i into X^es Sj^s> anc ^ using 
Cauchy-Schwarz inequality. 

By assumption (ii), and the fact that 7(0) = E|X t | 2 = 1, we have 

E \tn,j ~ < E(<k, " = Un~ 0lldiag(r p ) 

jes j=i 

< —wk-rwk' (35) 

The proof is completed by applying the Cauchy-Schwarz inequality on (34), and by using assumption (i). 

□ 

We turn to the second lemma. 
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Lemma 6.4. Let assumptions (i), (ii) of Theorem 3.3 be in effect. Let C be a constant (given explicitly in 
the proof) depending on M only. Put e — A n (sp — 1 ) 1 ' 2 . Then, 



1 <C\lsn- 



\\<t>n - § nr, 

holds true on 3\ n 3 2 , where 3\ is given by (33), and 

h = {M p < e} , 



with 

Proof. Note that 
Therefore, 



M p = max 

l<i,j<p 



nil/* 



n 



l(i - j) 



(36) 

(37) 
(38) 



< M, 



p\\Vn 



> 



Un-rf^-MpipK- 1 ) 1 / 2 ^-^ 



The first inequality follows since ||a||i < n||a|| 2 , and the second inequality is satisfied under assumption (ii) 
(see (35)). Referring back to (32), we obtain, on 3ir\3 2 , 

\\k-4>*\\r p < 2(1/2 + 2M)\ n ( SK - 1 ) 1 / 2 U n ~<j ) *\\ rp . 

Applying the inequality 2xy < 2x 2 + y 2 /2 on the right-hand side of the expression above (with x = (1/2 + 
2M)\ n (sK~ 1 ) 1/2 , and y = \\<f> n - 0*||r p ), we establish the statement of the Lemma, with C = 4(1/2 + 2M) 2 . 

□ 

The rest of the proof of Theorem 3.3 is devoted to showing that indeed \\cj> n — 4>*\\^ < CX^sk^ 1 holds 
on a negligible event, i.e., that the probability of the complement of 3\ is negligible. We shall commence 
with 

We recall here the family of time series {^t}, denoted by "K p (l, L), for some p > 1, < I < 1, and L > 1 
(Section 3.2). The family consists of all stationary Gaussian time series with EX t = 0, E|X t | 2 = 1, and 
enjoys an exponential decay of the strongly mixing coefficients (see (8)). 

Lemma 6.5. Assume that e = A„(s/p) 1/2 < Dn~ 2/5 , where D = (Cf C^/? 2 /?!) 1 / 5 , with C\ and C 2 two 
constants explicitly specified in the proof. Then, 



Proof. We begin with 



where 



P(J§) < p 2 exp { - n\ 2 n (s/p 2 )/ (AC,^)} 

t=l— i 

Y t = Y t<itj = 1 (X t X t+i _ j - j(i - j)) . 



(39) 



The proof is based on an application of the pair of lemmas 6.6 and 6.7, after noticing that 

P / n—i \ 

P(3§)=P(M p >e)< E P I E y 'l >c • 

i,j=l V t=l— i / 
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Define k = i — j . It is enough to consider only k > (i > j), since Xij and 7(1 — j) are symmetric. By 
the same argument below expression (39) in Goldcnshlugcr and Zeevi (2001), one may notice that {Y t } is 
strongly mixing with the rate ay(m) < ax(m — k) for m > k, and ay(m) < 1/4 (see Bradley (2005)), but 
for our purposes in would be enough to bound ay(ro), for m > k, by simply 1. 

We shall make use of the following two lemmas, adapted from Goldcnshluger and Zeevi (2001). 

Lemma 6.6. Suppose {X t } is a strongly mixing time series, S n = J2t=i> an d cum r (S n ) is the rth order 
cumulant of S n . For v > define the function 

n 

K n (a x ,v) = max{l, ^(ax(m)) 1/l '} . 

TTL—1 

If, for some fi > 0. H > 

E\X t \ r < (r\r +1 H r , t = l,...,n,r = 2,3 J ... , 
then \cum r (S n )\ < 2 r(1+ ^ +1 l2 r - 1 (r\) 2+ ^H r [A n (a x , 2(r ~~ iW^n. 

Lemma 6.7. Let Y be a random variable with EY = 0. If there exist [i\ > 0, H\ > and A > such that 

/ r l\ 1 +A'i H 
\cum r (Y)\ < f-J r = 2,3,..., 

then 

P(\Y\ ■>!,)<! exp{-2/ 2 /(4ffi)} < y < (l} + '*A) 1 /(*+ 1 ) 

U ' ' \ exp{-( 2 /A) 1 /(i+M 1 )/ 4 } y > (#1+^)1/(2^+1) _ 

Back to the proof of Lemma 6.5. Absolute moment of Y t are bounded as follows: 

Wt\ r < n-^ i [E|x t x t+fc r + | 7 (fc)ri 



< n- r 2 r - 1 



< r!(4/n) r 



(E\X t \ 2r E\X t+k \ 2r ) 1/2 + 7 (0) 



The second inequality follows by the Cauchy-Schwarz inequality together with the inequality (a + 6) J < 
2 3 ~ 1 (a J + & 3 ), and the last inequality follows by the assumed Gaussianity of X t , and the inequality ( 2 r r ) < 2 2r . 
We have 

n / or; \ VC^-i) n ~ k 

E(^(-)) 1/2M) < ^(it^tt) E^ m/2(r - 1} 

?n = l ^ m=l 



i{p-i)) V 1o sp 

The first inequality utilizes the relationship between ay(m) and a^m), and inequality (8). The second 
inequality uses geometric series expression together with the inequality p x — 1 > xlogp, for all x > 0. 

Therefore, defining fc = k if fc > 0, and k = 1, if k = 0, we obtain, after some manipulations, similar to 
those in Goldcnshlugcr and Zeevi (2001), 

[A n (a x ,2(r-l))r 1 < 12 I - 1 H(fc/3 1 ) r - 1 /3 2 , 

for two constants /3i and /?2, given, respectively, by 1 + 1/logp and 1 + Lp/l(p — 1) (see (9)). The bound 
results from the inequalities (a + by < 2 J ~ 1 (a : ' + n n < nle n , and other trivial inequalities. 
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Applying Lemma 6.6 (with (x = and H — 4/n.) we have \cum r (^?_ l _ i )Yt\ < RHS, where RHS can be 
put in the form (H/2) 3 #A 2 - r , with H i = Cif3 2 (k/3i/n), A = C 2 (kf3 1 /n)-\ and C x = 2 10 12 2 , C 2 = 2- 3 12- 2 . 
Now, applying Lemma 6.7 (with /ii = 2, and iii and A as above) we obtain: 



n— % 



exp 



y 2 n/(AC 1 k(3 1 /3 2 )^ < y < Dk 2 / 5 n - 2 / 5 



*[ l £- i Ytl>y ) -\ expj-K^'V) 1 ' 3 } y>^n-»/«, (40) 

where 13 = (C 3 ^/^ 2 /?!) 1 ^ 5 . The proof is completed by applying the moderate deviation part in (40) with 
y = e, and by noticing that 1 < fc < p. 

□ 

We turn to evaluate the probability of the complement of the event 3\. 
Lemma 6.8. For all < c < oo and y > a 2 (n + L>n 3 / 5 ) (where D is given by (9)), 

P(^) < 6pcxp (-fi min |(a- 2 y - n) 1 / 3 , c 2 ^ 2 , "'f^" /0 U , 
I I y + cnA n A max /2 J J 

w/iere Fi = min { (C 2 //?i) 1/4 /4, 2~ 9 , 8- 1 } . 

Proof. Let l/„ 2 = a 2 Y%=t X t-i = ^ YTtZl-i ^l- Fix a y > a 2 (n + fln 3 / 5 ) and a < c < oo. Denote by ^ 
the event 3i (see (33)) with the absolute value removed. We begin by writing: 

p „ n 

P(?i) < E p (-E Xt -.? Zt > A " A »j) 

p oo „ n 

^ E p ( U { n E > A « A «^ > y ™ ^ y)) +p v (v* > y) 

j=l n=l t=l 
=: Ii + h ■ 

Clearly 1\ satisfies I\ < I%\ + I± 2 , with 

p oo 



J n = E p ( U {-E**-^* > A « A ^ . v " < fl {i**-jr 2 E|z*r < £*v- 2 }) 

j=l ra=l t=l r=3 
P oo . 

^ = E p (U{i^r 2E i^r>r 2 ^ 2 })- 



3=1 r=3 

We analyze P(Jf) by investigating In, J12 and J2 separately. 

For I 2 , we recall that Y t = Y t ^^ — (X 2 — j(0))/n (see (39) and the remark below) is strongly mixing 
with exponential decay rate. Therefore, by the large deviation part in (40) (with k = 1), 

V(V 2 >y) < V(\V 2 - na 2 \ > y - na 2 ) 

n—i 

= P(| J2 Y t \>a- 2 n- 1 y-l) 

t=l-i 

< ex P {-I(|) 1/4 (( x- 2 ,-n)V 3 

For /12, we use the bound E|Zt| 2r < a 2r r\2 2r (and the Cauchy-Schwarz inequality) to obtain 
{\X t ^r 2 E\Z t \ r > ^aV- 2 } C {\X t -j\ > 2-^^ r - 2 ^- 1 c} . 
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Therefore, noticing that {2 ( 1+r )/( r 2 '}^ 3 is an increasing sequence, we have 

p 

hi < J2 ¥ (\ X t-j\ > 2" 4 ^ _1 c) < (2/7r) 1 / 2 pexp{-2- 8 c 2 /2a 2 } . 

3=1 

For Jn, we use the following theorem which is a Bernstein's type of an inequality for martingales. 

Theorem 6.9 (De La Peha (1999)). Let {M„,J n } be a martingale, with difference A„ = M n — M n _i. 
Define V% = £™ =1 cr? = £JU E ( A ? I ^-i)- -4«™me ^ E(| Ai| r | < (r\/2)afc r - 2 a.e. for r > 3, 

< c < oo. Then, for all x, y > 0, 

P( U{M„ > », F„ 2 < „}) < expj-^^) • (41) 

Recall that Y^t=i Xt-jZf is a martingale (see (18)). Then, simple application of the above theorem, with 
x = n\ n X n j/2, leads to 

Al < pex p(- , **** ,\. 

I 8(y + cnA n A max /2) J 

Lemma 6.8 now follows by collecting the bounds of In, and I2, and by symmetry. 

□ 

The proof of theorem 3.3 is now complete by virtue of Lemma 6.3, Lemma 6.4, Lemma 6.5, and Lemma 

6.8. 

□ 
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